!jt -t grade3
# !jt -r
Simple linear regression is a statistical method used to model the relationship between a single independent variable (predictor) and a continuous dependent variable (response). Here's a brief introduction to simple linear regression, covering its purpose, strengths, weaknesses, and the types of data it works best on:
The purpose of simple linear regression is to understand and quantify the linear relationship between two variables. It helps in predicting the value of the dependent variable based on the value of the independent variable. The model assumes a straight-line relationship between the variables and estimates the parameters (slope and intercept) that define this line.
Interpretability: Simple linear regression results in a straightforward model represented by a linear equation (y = mx + b), making it easy to interpret and communicate.
Prediction: It provides a simple method for making predictions. Once the model parameters are estimated, you can use the model to predict the dependent variable for new values of the independent variable.
Visualization: The relationship between variables can be visualized easily with a scatter plot and the fitted regression line.
Assumption of Linearity: Simple linear regression assumes a linear relationship between the variables, which may not be suitable for data with non-linear patterns.
Sensitivity to Outliers: The model can be sensitive to outliers, and a single influential data point can significantly impact the estimated parameters.
Assumption of Independence: The model assumes that observations are independent. If there is dependence between observations, the model's performance may be affected.
Simple linear regression works best when:
In summary, simple linear regression is a valuable tool for modeling and understanding the linear relationship between two variables. Its simplicity makes it a good choice for situations where a linear model is appropriate and interpretable.
# install the opendatasets package
# !pip install opendatasets
import opendatasets as od
# download the dataset (this is a Kaggle dataset) during download you will be required to input your Kaggle username and password
od.download("https://www.kaggle.com/datasets/ramlalnaik/fuelconsumptionco2?select=FuelConsumptionCo2.csv")
Skipping, found downloaded files in "./fuelconsumptionco2" (use force=True to force download)
# an alternative approach in case that was too simple :)
import requests
async def download(url, filename):
response = requests.get(url)
if response.status_code == 200:
with open(filename, "wb") as f:
f.write(response.content)
else:
print(f"Error: {response.status_code} - {response.reason}")
# Example usage
url = "https://www.kaggle.com/datasets/ramlalnaik/fuelconsumptionco2?select=FuelConsumptionCo2.csv"
filename = "FuelConsumptionCo2.txt"
await download(url, filename)
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df = pd.read_csv("/Users/davidfoutch/Desktop/CourseEra/IBM_AI/FuelConsumptionCo2.csv")
# take a look at the dataset
print(df.describe())
df.head()
MODELYEAR ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY \
count 1067.0 1067.000000 1067.000000 1067.000000
mean 2014.0 3.346298 5.794752 13.296532
std 0.0 1.415895 1.797447 4.101253
min 2014.0 1.000000 3.000000 4.600000
25% 2014.0 2.000000 4.000000 10.250000
50% 2014.0 3.400000 6.000000 12.600000
75% 2014.0 4.300000 8.000000 15.550000
max 2014.0 8.400000 12.000000 30.200000
FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG \
count 1067.000000 1067.000000 1067.000000
mean 9.474602 11.580881 26.441425
std 2.794510 3.485595 7.468702
min 4.900000 4.700000 11.000000
25% 7.500000 9.000000 21.000000
50% 8.800000 10.900000 26.000000
75% 10.850000 13.350000 31.000000
max 20.500000 25.800000 60.000000
CO2EMISSIONS
count 1067.000000
mean 256.228679
std 63.372304
min 108.000000
25% 207.000000
50% 251.000000
75% 294.000000
max 488.000000
| MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
| 1 | 2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
| 2 | 2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
| 3 | 2014 | ACURA | MDX 4WD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
| 4 | 2014 | ACURA | RDX AWD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook'
# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=columns)
# Populate subplots with histograms
trace0 = go.Histogram(x=df['ENGINESIZE'])
trace1 = go.Histogram(x=df['CYLINDERS'])
trace2 = go.Histogram(x=df['FUELCONSUMPTION_COMB'])
trace3 = go.Histogram(x=df['CO2EMISSIONS'])
# Update layout
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 2)
fig.update_layout(
title='Histograms of Selected Columns',
showlegend=False
)
# Show the plot
# fig.show()
import plotly.graph_objects as go
%matplotlib notebook
# Sample data
# Sample data
x_values = df['FUELCONSUMPTION_COMB']
y_values = df['CO2EMISSIONS']
# Create a scatter plot with styling
fig = go.Figure(data=go.Scatter(
x=x_values,
y=y_values,
mode='markers',
marker=dict(
color='darkorange',
size=12,
line=dict(color='gray', width=1),
symbol='circle',
opacity=0.8
),
line=dict(
color='green',
width=2,
dash='dash'
)
))
# Update layout
fig.update_layout(
title='Emissions and Fuel Consumption',
xaxis_title='FUELCONSUMPTION_COMB',
yaxis_title='Emissions',
template='ggplot2'
)
import plotly.graph_objects as go
%matplotlib notebook
# Sample data
# Sample data
x_values = df['ENGINESIZE']
y_values = df['CO2EMISSIONS']
# Create a scatter plot with styling
fig = go.Figure(data=go.Scatter(
x=x_values,
y=y_values,
mode='markers',
marker=dict(
color='darkorange',
size=12,
line=dict(color='gray', width=1),
symbol='circle',
opacity=0.8
),
line=dict(
color='green',
width=2,
dash='dash'
)
))
# Update layout
fig.update_layout(
title='Emissions and Engine Size',
xaxis_title='Engine Size',
yaxis_title='Emissions',
template='ggplot2'
)
In the context of simple linear regression (or regression modeling in general), the process of splitting data into training and test sets is crucial for evaluating the performance of the model. Here's a description of the train and test set procedure and why it is done:
Data Splitting:
Training Set:
Test Set:
Model Evaluation:
Generalization:
Avoiding Overfitting:
Parameter Tuning:
In Python, you can use libraries like scikit-learn to split your data into training and test sets. Here's an example:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Assuming X is your feature matrix and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model and fit it on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
# Perform further evaluation (e.g., calculate metrics, visualize results)
By splitting your data into training and test sets, you can build and evaluate regression models more effectively, ensuring that they perform well on new, unseen data.
When using the train_test_split function from scikit-learn in Python, you don't need to manually create a mask. The function takes care of splitting your data into training and test sets randomly.
In this example:
train_test_split automatically splits your data into training and test sets based on the specified test_size (the proportion of the dataset to include in the test split).random_state is used to ensure reproducibility. Setting a specific seed for random_state will result in the same split every time you run the code.The function returns four sets: X_train, X_test, y_train, and y_test, which you can use for training and evaluating your regression model.
So, you don't need to create a mask manually; the function handles the data splitting for you.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
train_x = np.asanyarray(df[['ENGINESIZE']])
train_y = np.asanyarray(df[['CO2EMISSIONS']])
# Assuming X is your feature matrix and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, test_size=0.2, random_state=42)
# Create a linear regression model and fit it on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
print ('Coefficients: ', model.coef_)
print ('Intercept: ', model.intercept_)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Coefficients: [[38.99297872]] Intercept: [126.28970217]
Certainly! Here are the formulas rendered in LaTeX commands:
Coefficient of Determination (R-squared):
Mean Squared Error (MSE) or Root Mean Squared Error (RMSE):
Mean Absolute Error (MAE):
Residual Analysis and Residual Plots:
Adjusted R-squared:
F-statistic:
When interpreting these metrics, it's important to consider the specific characteristics of your data and the goals of your analysis. Additionally, using a combination of metrics provides a more comprehensive evaluation of model performance.
from sklearn.metrics import r2_score
# X_test = np.asanyarray(X_test)
# y_test = np.asanyarray(y_test)
y_pred = model.predict(X_test)
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_pred - y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_pred - y_test) ** 2))
print("R2-score: %.2f" % r2_score(y_test, y_pred))
Mean absolute error: 24.10 Residual sum of squares (MSE): 985.94 R2-score: 0.76
x_values = df['ENGINESIZE']
y_values = df['CO2EMISSIONS']
# Create a scatter plot with styling
scatter_trace = go.Scatter(
x=x_values,
y=y_values,
mode='markers',
marker=dict(
color='darkorange',
size=12,
line=dict(color='gray', width=1),
symbol='circle',
opacity=0.8
),
name='Scatter Plot'
)
# Create a regression line trace
regression_line_trace = go.Scatter(
x=x_values,
y=np.polyval(np.polyfit(x_values, y_values, 1), x_values),
mode='lines',
line=dict(color='green', width=2, dash='dash'),
name='Regression Line'
)
# Create the figure
fig = go.Figure(data=[scatter_trace, regression_line_trace])
# Update layout
fig.update_layout(
title='Emissions and Engine Size',
xaxis_title='Engine Size',
yaxis_title='Emissions',
template='ggplot2'
)